1 Introduction

Multiple previous studies have shown that members of the family of nuclear factor kB (NF-kB) transcription factors can play very important roles in mulitple cellular processes (immune response, cell proliferation, cell survival, etc.). Activation of canonical NF-kB signaling pathways are fairly well studied, but there is a lack of information regarding the activation of noncanonical NF-kB signaling pathways. Using high-content phenotypic screening, Henry et al. (2018) were able to identify cyclin-dependent kinase 12 (CDK12) as a regulator of this pathway. Further, chemoproteomics identified the chemical compound 919278 which targets CDK12, specifically functioning as an inhibitor of CDK12. To better understand global trancsriptional changes, the authors sequenced the transcriptome of a human osteosarcoma cell line (U2OS) and treated the cells in a range of different experimental conditions (stimulation (TWEAK or unstimulated), time of treatment/stimulation (24hr or 4hr), and treatments (BIO0919278, BIO0702697, BIO032202, DMSO, NTsiRNA, siRNAs523626, siRNAs523629)).

It is important to note that BIO0702697 is the S-enantiomer of BIO0919278. Using high-throughput screening techniques, the authors found that BIO0702697 was significantly less potent of a compound relative to BIO0919278 relative to activating the noncanonical NF-kB signaling pathway. The main goal of the differential expression analysis in this data analysis is to analyze differential gene expression between the two enantiomer treatments (BIO0702697 and BIO0919278). This type of analysis would help us determine the extent to which stereochemistry of these molecules affects pathways related to noncanonical NF-kB signaling. It will ideally help us validate and provide further evidence for the results in the original paper.

The resulting raw RNA-seq data have been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession GSE113926. Here, we show a first exploratory analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (http://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a SummarizedExperiment object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.

2 Quality Assessment

2.1 Data Import and Cleaning

We start importing the raw table of counts.

library(SummarizedExperiment)

se <- readRDS(usethis::proj_path(file.path("data", "data.rds")))
se
class: RangedSummarizedExperiment 
dim: 25122 44 
metadata(4): experimentData annotation ensemblVersion urlProcessedData
assays(1): counts
rownames(25122): 1 10 ... 9994 9997
rowData names(5): gene_id gene_biotype description gene_id_version
  symbol
colnames(44): SRR7089369 SRR7089370 ... SRR7089423 SRR7089424
colData names(44): title geo_accession ... time:ch1 treatment:ch1

We have 25122 genes by 44 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run ([SRR] (https://www.ncbi.nlm.nih.gov/books/NBK56913/#search.what_do_the_different_sra_accessi)) identifiers.

The row data in this object contains information about the profiled genes.

head(rowData(se))
DataFrame with 6 rows and 5 columns
                  gene_id   gene_biotype
              <character>    <character>
1         ENSG00000121410 protein_coding
10        ENSG00000156006 protein_coding
100       ENSG00000196839 protein_coding
1000      ENSG00000170558 protein_coding
10000     ENSG00000117020 protein_coding
100008586 ENSG00000236362 protein_coding
                                                              description
                                                              <character>
1                  alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
10               N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
100                 adenosine deaminase [Source:HGNC Symbol;Acc:HGNC:186]
1000                        cadherin 2 [Source:HGNC Symbol;Acc:HGNC:1759]
10000     AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]
100008586               G antigen 12F [Source:HGNC Symbol;Acc:HGNC:31906]
             gene_id_version      symbol
                 <character> <character>
1         ENSG00000121410.11        A1BG
10         ENSG00000156006.4        NAT2
100       ENSG00000196839.12         ADA
1000       ENSG00000170558.8        CDH2
10000     ENSG00000117020.16        AKT3
100008586  ENSG00000236362.8     GAGE12F

Among this information, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis. Let’s explore now the column (phenotypic) data.

dim(colData(se))
[1] 44 44
head(colData(se), n=3)
DataFrame with 3 rows and 44 columns
                             title geo_accession                status
                          <factor>   <character>              <factor>
SRR7089369 UnStim-DMSO-4hr-A          GSM3123750 Public on Jul 31 2018
SRR7089370 TWEAK-DMSO-4hr-A           GSM3123751 Public on Jul 31 2018
SRR7089373 UnStim-BIO0919278-4hr-A    GSM3123754 Public on Jul 31 2018
           submission_date last_update_date     type channel_count
                  <factor>         <factor> <factor>      <factor>
SRR7089369     May 01 2018      Jul 31 2018      SRA             1
SRR7089370     May 01 2018      Jul 31 2018      SRA             1
SRR7089373     May 01 2018      Jul 31 2018      SRA             1
           source_name_ch1 organism_ch1 characteristics_ch1
                  <factor>     <factor>            <factor>
SRR7089369       U2OS cell Homo sapiens stimulation: UnStim
SRR7089370       U2OS cell Homo sapiens stimulation: TWEAK 
SRR7089373       U2OS cell Homo sapiens stimulation: UnStim
           characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3
                        <factor>              <factor>              <factor>
SRR7089369 treatment: DMSO                   time: 4hr       cell line: U2OS
SRR7089370 treatment: DMSO                   time: 4hr       cell line: U2OS
SRR7089373 treatment: BIO0919278             time: 4hr       cell line: U2OS
           molecule_ch1
               <factor>
SRR7089369    polyA RNA
SRR7089370    polyA RNA
SRR7089373    polyA RNA
                                                                   extract_protocol_ch1
                                                                               <factor>
SRR7089369 RNA libraries were prepared for sequencing using standard Illumina protocols
SRR7089370 RNA libraries were prepared for sequencing using standard Illumina protocols
SRR7089373 RNA libraries were prepared for sequencing using standard Illumina protocols
           taxid_ch1 description
            <factor>    <factor>
SRR7089369      9606 replicate A
SRR7089370      9606 replicate A
SRR7089373      9606 replicate A
                                                            data_processing
                                                                   <factor>
SRR7089369 Reads were aligned to reference genome (hg19) using STAR aligner
SRR7089370 Reads were aligned to reference genome (hg19) using STAR aligner
SRR7089373 Reads were aligned to reference genome (hg19) using STAR aligner
                                                                                     data_processing.1
                                                                                              <factor>
SRR7089369 QC includes sequence quality, GC content, 5’-3’ gene body coverage (Supplementary Table S6)
SRR7089370 QC includes sequence quality, GC content, 5’-3’ gene body coverage (Supplementary Table S6)
SRR7089373 QC includes sequence quality, GC content, 5’-3’ gene body coverage (Supplementary Table S6)
                                                                                                                                                                                                                                                                             data_processing.2
                                                                                                                                                                                                                                                                                      <factor>
SRR7089369 Outlier detection absolute Z score > 2 was applied on overall sequencing quality score, 5 prime coverage, 3 prime coverage, mean_GC content, duplication rate, mean_ and mapped percentage. Sample with absolute Z score > 2 would be discarded, which did not apply to this study.
SRR7089370 Outlier detection absolute Z score > 2 was applied on overall sequencing quality score, 5 prime coverage, 3 prime coverage, mean_GC content, duplication rate, mean_ and mapped percentage. Sample with absolute Z score > 2 would be discarded, which did not apply to this study.
SRR7089373 Outlier detection absolute Z score > 2 was applied on overall sequencing quality score, 5 prime coverage, 3 prime coverage, mean_GC content, duplication rate, mean_ and mapped percentage. Sample with absolute Z score > 2 would be discarded, which did not apply to this study.
                                                                                                                   data_processing.3
                                                                                                                            <factor>
SRR7089369 Aligned reads were counted against gene model annotation (Gencode v18) to obtain expression values by using FeatureCounts
SRR7089370 Aligned reads were counted against gene model annotation (Gencode v18) to obtain expression values by using FeatureCounts
SRR7089373 Aligned reads were counted against gene model annotation (Gencode v18) to obtain expression values by using FeatureCounts
                                           data_processing.4  data_processing.5
                                                    <factor>           <factor>
SRR7089369 DESeq2 was used for gene expression normalization Genome_build: hg19
SRR7089370 DESeq2 was used for gene expression normalization Genome_build: hg19
SRR7089373 DESeq2 was used for gene expression normalization Genome_build: hg19
                                                                                               data_processing.6
                                                                                                        <factor>
SRR7089369 Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each Sample:
SRR7089370 Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each Sample:
SRR7089373 Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each Sample:
                                                                                                        data_processing.7
                                                                                                                 <factor>
SRR7089369 DESeq2_normalized_count_matrix.txt: Count matrix for counts per gene, normalized for library size using DESeq2
SRR7089370 DESeq2_normalized_count_matrix.txt: Count matrix for counts per gene, normalized for library size using DESeq2
SRR7089373 DESeq2_normalized_count_matrix.txt: Count matrix for counts per gene, normalized for library size using DESeq2
                                                                                                                                                                                                                                                                                                                                                     data_processing.8
                                                                                                                                                                                                                                                                                                                                                              <factor>
SRR7089369 DESeq2_regularized_log_transformed.txt: Made using the rlogTransformation function in DESeq2. This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. These are the values used to obtain clustering and PCA results.
SRR7089370 DESeq2_regularized_log_transformed.txt: Made using the rlogTransformation function in DESeq2. This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. These are the values used to obtain clustering and PCA results.
SRR7089373 DESeq2_regularized_log_transformed.txt: Made using the rlogTransformation function in DESeq2. This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. These are the values used to obtain clustering and PCA results.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           data_processing.9
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <factor>
SRR7089369 QC_statistics_to_deliver.txt: Tot_reads(M)- Total number of reads in the sample, in millions  rRNA(%)- Percent of reads aligned to ribosomal RNA  Map(%)- Percent of reads which map to genome  UQ_map(%)- Percent of reads which map to only one locus on the genome  Gene_assn(%)- Of uniquely mapped reads, percentage which may be assigned to exons of genes by gene counting program featureCounts  Strand(%)- Percentage of reads which are reversely stranded. If unstranded protocol, this is not counted and listed as 0.  5prime_cov- Mean normalized coverage from Picard for percentiles 11-30 along gene body  3prime_cov- Mean normalized coverage from Picard for percentiles 71-90 along gene body  Mean_GC(%)- Mean GC content averaging GC content for each read across all reads  Dup(%)- Percentage of reads which were duplicates  Mean_inner_dist- Inner distance = Read 2 start - Read 1 end. Negative indicates overlap.  CDS, UTR, Intronic, and Intergenic(%)- Percentage of reads assigned to different regions of the genome. These four add up to 100%.  chrM(%)- For reads assigned to genes, percentage assigned to mitochondrial genes.  Top_ten(%)- For reads assigned to genes, percentage assigned to the top 10 most highly expressed genes.
SRR7089370 QC_statistics_to_deliver.txt: Tot_reads(M)- Total number of reads in the sample, in millions  rRNA(%)- Percent of reads aligned to ribosomal RNA  Map(%)- Percent of reads which map to genome  UQ_map(%)- Percent of reads which map to only one locus on the genome  Gene_assn(%)- Of uniquely mapped reads, percentage which may be assigned to exons of genes by gene counting program featureCounts  Strand(%)- Percentage of reads which are reversely stranded. If unstranded protocol, this is not counted and listed as 0.  5prime_cov- Mean normalized coverage from Picard for percentiles 11-30 along gene body  3prime_cov- Mean normalized coverage from Picard for percentiles 71-90 along gene body  Mean_GC(%)- Mean GC content averaging GC content for each read across all reads  Dup(%)- Percentage of reads which were duplicates  Mean_inner_dist- Inner distance = Read 2 start - Read 1 end. Negative indicates overlap.  CDS, UTR, Intronic, and Intergenic(%)- Percentage of reads assigned to different regions of the genome. These four add up to 100%.  chrM(%)- For reads assigned to genes, percentage assigned to mitochondrial genes.  Top_ten(%)- For reads assigned to genes, percentage assigned to the top 10 most highly expressed genes.
SRR7089373 QC_statistics_to_deliver.txt: Tot_reads(M)- Total number of reads in the sample, in millions  rRNA(%)- Percent of reads aligned to ribosomal RNA  Map(%)- Percent of reads which map to genome  UQ_map(%)- Percent of reads which map to only one locus on the genome  Gene_assn(%)- Of uniquely mapped reads, percentage which may be assigned to exons of genes by gene counting program featureCounts  Strand(%)- Percentage of reads which are reversely stranded. If unstranded protocol, this is not counted and listed as 0.  5prime_cov- Mean normalized coverage from Picard for percentiles 11-30 along gene body  3prime_cov- Mean normalized coverage from Picard for percentiles 71-90 along gene body  Mean_GC(%)- Mean GC content averaging GC content for each read across all reads  Dup(%)- Percentage of reads which were duplicates  Mean_inner_dist- Inner distance = Read 2 start - Read 1 end. Negative indicates overlap.  CDS, UTR, Intronic, and Intergenic(%)- Percentage of reads assigned to different regions of the genome. These four add up to 100%.  chrM(%)- For reads assigned to genes, percentage assigned to mitochondrial genes.  Top_ten(%)- For reads assigned to genes, percentage assigned to the top 10 most highly expressed genes.
                                                                                           data_processing.10
                                                                                                     <factor>
SRR7089369 featureCounts_count_matrix.txt: Raw count matrix with counts per gene as obtained by featureCounts
SRR7089370 featureCounts_count_matrix.txt: Raw count matrix with counts per gene as obtained by featureCounts
SRR7089373 featureCounts_count_matrix.txt: Raw count matrix with counts per gene as obtained by featureCounts
                                                                                                                               data_processing.11
                                                                                                                                         <factor>
SRR7089369 gene_annotation_info.txt: More info on the genes in the count matrices, such as gene coordinates, biotype (coding vs. noncoding), etc.
SRR7089370 gene_annotation_info.txt: More info on the genes in the count matrices, such as gene coordinates, biotype (coding vs. noncoding), etc.
SRR7089373 gene_annotation_info.txt: More info on the genes in the count matrices, such as gene coordinates, biotype (coding vs. noncoding), etc.
                                                                                                                                                                                       data_processing.12
                                                                                                                                                                                                 <factor>
SRR7089369 kallisto_TPM_matrix.txt: Per transcript in the annotation, the TPM (transcripts per million) value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089370 kallisto_TPM_matrix.txt: Per transcript in the annotation, the TPM (transcripts per million) value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089373 kallisto_TPM_matrix.txt: Per transcript in the annotation, the TPM (transcripts per million) value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
                                                                                                                                                                                data_processing.13
                                                                                                                                                                                          <factor>
SRR7089369 kallisto_est_count_matrix.txt: Per transcript in the annotation, the estimated counts value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089370 kallisto_est_count_matrix.txt: Per transcript in the annotation, the estimated counts value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089373 kallisto_est_count_matrix.txt: Per transcript in the annotation, the estimated counts value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
           platform_id data_row_count    instrument_model library_selection
              <factor>       <factor>            <factor>          <factor>
SRR7089369    GPL16791              0 Illumina HiSeq 2500              cDNA
SRR7089370    GPL16791              0 Illumina HiSeq 2500              cDNA
SRR7089373    GPL16791              0 Illumina HiSeq 2500              cDNA
           library_source library_strategy
                 <factor>         <factor>
SRR7089369 transcriptomic          RNA-Seq
SRR7089370 transcriptomic          RNA-Seq
SRR7089373 transcriptomic          RNA-Seq
                                                                 relation
                                                                 <factor>
SRR7089369 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN09007224
SRR7089370 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN09007223
SRR7089373 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN09007220
                                                      relation.1
                                                        <factor>
SRR7089369 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX4018354
SRR7089370 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX4018355
SRR7089373 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX4018358
           supplementary_file_1 cell line:ch1 stimulation:ch1    time:ch1
                       <factor>   <character>     <character> <character>
SRR7089369                 NONE          U2OS          UnStim         4hr
SRR7089370                 NONE          U2OS           TWEAK         4hr
SRR7089373                 NONE          U2OS          UnStim         4hr
           treatment:ch1
             <character>
SRR7089369          DMSO
SRR7089370          DMSO
SRR7089373    BIO0919278

We have a total of 44 phenotypic variables. The second column geo_accession contains GEO Sample Accession Number (GSM) identifers. GSM identifiers define individual samples, understood in this context as individual sources of RNA. If the GSM identifiers are repeated, this indicates that among the 44 samples there are technical replicates. We can figure out how many technical replicates per GSM sample we have as follows:

length(unique(se$geo_accession))
[1] 44
table(lengths(split(colnames(se), se$geo_accession)))

 1 
44 

This has shown us that there are no technical replicates among the 44 samples (all GSM identifiers are unique).

To proceed further exploring this dataset, we are going to use the edgeR package and build a DGEList object, incorporating the gene metadata, which includes the gene symbol.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=rowData(se))
dim(dge)
[1] 25122    44

We will now calculate the \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation. CPM is a form of within-sample scaling using counts per million reads (CPM) mapped to the genome.

assays(se)$logCPM <- cpm(dge, log=TRUE)
assays(se)$logCPM[1:5, 1:5]
      SRR7089369 SRR7089370 SRR7089373 SRR7089374 SRR7089375
1      -1.168200 -0.5456482 -0.2786859  -1.523985  -1.334666
10     -2.234508 -1.7036348 -2.2345084  -2.234508  -2.234508
100     3.904497  4.0985553  4.4986419   4.357345   4.027829
1000    7.875925  7.8886217  7.9733523   8.040212   7.904940
10000   7.413678  7.4171096  7.2886915   7.341795   7.326141

Let’s explore now some of the phenotypic variables. Based on the metadata that is provided, we know what information is stored in each variable. After some visual inspection, we can see that the variables characteristics_ch1, characteristics_ch1.1, characteristics_ch1.2,characteristics_ch1.3 contain the information about stimulation, treatment, time, and cell line, respectively. These variables represent the relevant experimental conditions

table(se$characteristics_ch1) #stimulation

 stimulation: TWEAK stimulation: UnStim 
                 23                  21 
table(se$characteristics_ch1.1) #treatment

   treatment: BIO032202   treatment: BIO0702697   treatment: BIO0919278 
                      7                       5                       7 
        treatment: DMSO      treatment: NTsiRNA treatment: siRNAs523626 
                      5                       8                       5 
treatment: siRNAs523629 
                      7 
table(se$characteristics_ch1.2) #time

 time: 24h time: 24hr  time: 4hr 
         0         22         22 
table(se$characteristics_ch1.3) #cell line

cell line: U2OS 
             44 

To facilitate the handling of these variables we are going to recode them as follows:

se$stimulation <- se$characteristics_ch1
se$treatment <- se$characteristics_ch1.1
se$time <- se$characteristics_ch1.2
se$cell_line <- se$characteristics_ch1.3

We can also identify some variables associated with technical factors, such as the sample preparation protocol in extract_protocol_ch1.

table(se$extract_protocol_ch1)

RNA libraries were prepared for sequencing using standard Illumina protocols 
                                                                          44 

We can see that all samples were prepared using the same protocol. Because of there, there is unlikely to be any sort of variation in the data due to technical protocols.

Finally, we also observe that the variable description contains some relevant information about an apparent sub-grouping of the samples.

se$description
         V2          V3          V6          V7          V8          V9 
replicate A replicate A replicate A replicate A replicate A replicate A 
        V10         V11         V13         V14         V15         V17 
replicate A replicate A replicate A replicate A replicate A replicate A 
        V20         V21         V23         V24         V25         V26 
replicate B replicate B replicate B replicate B replicate B replicate B 
        V28         V29         V30         V31         V32         V33 
replicate B replicate B replicate B replicate B replicate B replicate B 
        V34         V35         V36         V37         V38         V39 
replicate A replicate A replicate A replicate A replicate A replicate A 
        V40         V41         V42         V44         V46         V47 
replicate A replicate A replicate A replicate A replicate B replicate B 
        V49         V50         V51         V52         V53         V55 
replicate B replicate B replicate B replicate B replicate B replicate B 
        V56         V57 
replicate B replicate B 
Levels: replicate A replicate B

We can see that there are biological replicates for each of the samples. Later on the analysis, we will determine whether these biological replicates may contribute to confounding batch effects.

In Table 1 below, we show a list of all samples and their corresponding experimental conditions in order to gain as much of an understaning as possible about the underlying experimental design.

Table 1: Phenotypic variables.
Identifer Cell line Time Stimulation Treatment Replicate
V2 SRR7089369 cell line: U2OS time: 4hr stimulation: UnStim treatment: DMSO replicate A
V3 SRR7089370 cell line: U2OS time: 4hr stimulation: TWEAK treatment: DMSO replicate A
V6 SRR7089373 cell line: U2OS time: 4hr stimulation: UnStim treatment: BIO0919278 replicate A
V7 SRR7089374 cell line: U2OS time: 4hr stimulation: TWEAK treatment: BIO0919278 replicate A
V8 SRR7089375 cell line: U2OS time: 4hr stimulation: UnStim treatment: BIO032202 replicate A
V9 SRR7089376 cell line: U2OS time: 4hr stimulation: TWEAK treatment: BIO032202 replicate A
V10 SRR7089377 cell line: U2OS time: 24hr stimulation: UnStim treatment: DMSO replicate A
V11 SRR7089378 cell line: U2OS time: 24hr stimulation: TWEAK treatment: DMSO replicate A
V13 SRR7089380 cell line: U2OS time: 24hr stimulation: TWEAK treatment: BIO0702697 replicate A
V14 SRR7089381 cell line: U2OS time: 24hr stimulation: UnStim treatment: BIO0919278 replicate A
V15 SRR7089382 cell line: U2OS time: 24hr stimulation: TWEAK treatment: BIO0919278 replicate A
V17 SRR7089384 cell line: U2OS time: 24hr stimulation: TWEAK treatment: BIO032202 replicate A
V20 SRR7089387 cell line: U2OS time: 4hr stimulation: UnStim treatment: BIO0702697 replicate B
V21 SRR7089388 cell line: U2OS time: 4hr stimulation: TWEAK treatment: BIO0702697 replicate B
V23 SRR7089390 cell line: U2OS time: 4hr stimulation: TWEAK treatment: BIO0919278 replicate B
V24 SRR7089391 cell line: U2OS time: 4hr stimulation: UnStim treatment: BIO032202 replicate B
V25 SRR7089392 cell line: U2OS time: 4hr stimulation: TWEAK treatment: BIO032202 replicate B
V26 SRR7089393 cell line: U2OS time: 24hr stimulation: UnStim treatment: DMSO replicate B
V28 SRR7089395 cell line: U2OS time: 24hr stimulation: UnStim treatment: BIO0702697 replicate B
V29 SRR7089396 cell line: U2OS time: 24hr stimulation: TWEAK treatment: BIO0702697 replicate B
V30 SRR7089397 cell line: U2OS time: 24hr stimulation: UnStim treatment: BIO0919278 replicate B
V31 SRR7089398 cell line: U2OS time: 24hr stimulation: TWEAK treatment: BIO0919278 replicate B
V32 SRR7089399 cell line: U2OS time: 24hr stimulation: UnStim treatment: BIO032202 replicate B
V33 SRR7089400 cell line: U2OS time: 24hr stimulation: TWEAK treatment: BIO032202 replicate B
V34 SRR7089401 cell line: U2OS time: 4hr stimulation: UnStim treatment: NTsiRNA replicate A
V35 SRR7089402 cell line: U2OS time: 4hr stimulation: TWEAK treatment: NTsiRNA replicate A
V36 SRR7089403 cell line: U2OS time: 4hr stimulation: UnStim treatment: siRNAs523626 replicate A
V37 SRR7089404 cell line: U2OS time: 4hr stimulation: TWEAK treatment: siRNAs523626 replicate A
V38 SRR7089405 cell line: U2OS time: 4hr stimulation: UnStim treatment: siRNAs523629 replicate A
V39 SRR7089406 cell line: U2OS time: 4hr stimulation: TWEAK treatment: siRNAs523629 replicate A
V40 SRR7089407 cell line: U2OS time: 24hr stimulation: UnStim treatment: NTsiRNA replicate A
V41 SRR7089408 cell line: U2OS time: 24hr stimulation: TWEAK treatment: NTsiRNA replicate A
V42 SRR7089409 cell line: U2OS time: 24hr stimulation: UnStim treatment: siRNAs523626 replicate A
V44 SRR7089411 cell line: U2OS time: 24hr stimulation: UnStim treatment: siRNAs523629 replicate A
V46 SRR7089413 cell line: U2OS time: 4hr stimulation: UnStim treatment: NTsiRNA replicate B
V47 SRR7089414 cell line: U2OS time: 4hr stimulation: TWEAK treatment: NTsiRNA replicate B
V49 SRR7089416 cell line: U2OS time: 4hr stimulation: TWEAK treatment: siRNAs523626 replicate B
V50 SRR7089417 cell line: U2OS time: 4hr stimulation: UnStim treatment: siRNAs523629 replicate B
V51 SRR7089418 cell line: U2OS time: 4hr stimulation: TWEAK treatment: siRNAs523629 replicate B
V52 SRR7089419 cell line: U2OS time: 24hr stimulation: UnStim treatment: NTsiRNA replicate B
V53 SRR7089420 cell line: U2OS time: 24hr stimulation: TWEAK treatment: NTsiRNA replicate B
V55 SRR7089422 cell line: U2OS time: 24hr stimulation: TWEAK treatment: siRNAs523626 replicate B
V56 SRR7089423 cell line: U2OS time: 24hr stimulation: UnStim treatment: siRNAs523629 replicate B
V57 SRR7089424 cell line: U2OS time: 24hr stimulation: TWEAK treatment: siRNAs523629 replicate B

At this point, the experimental protocol can be easily understood for the data. There are seven different treatment levels (BIO032202, BIO0702697, BIO0919278, DMSO, NTsiRNA, siRNAs523626, siRNAs523629), two different stimulation levels (Unstim, TWEAK), and two different time periods for stimulation and treatment (24hr, 4hr). The cell line is consistent for all samples (U2OS). There is a biological replicate (but no technical replicates) for each sample as well.

We can generate subgrouping varibales for each of the experimental conditions with multiple levels (treatment, stimulation, time, replicate). This is done as follows:

se$samplegrouptreatment <- factor(sapply(strsplit(as.character(se$treatment),
                                         "-"), function(x) x[1]))
table(se$samplegrouptreatment)

   treatment: BIO032202   treatment: BIO0702697   treatment: BIO0919278 
                      7                       5                       7 
        treatment: DMSO      treatment: NTsiRNA treatment: siRNAs523626 
                      5                       8                       5 
treatment: siRNAs523629 
                      7 

se$samplegroupdescription <- factor(sapply(strsplit(as.character(se$description),
                                         "-"), function(x) x[1]))
table(se$samplegroupdescription)

replicate A replicate B 
         22          22 

se$samplegrouptime <- factor(sapply(strsplit(as.character(se$time),
                                         "-"), function(x) x[1]))
table(se$samplegrouptime)

time: 24hr  time: 4hr 
        22         22 

se$samplegroupstim <- factor(sapply(strsplit(as.character(se$stimulation),
                                         "-"), function(x) x[1]))
table(se$samplegroupstim)

 stimulation: TWEAK stimulation: UnStim 
                 23                  21 

se$samplegroupcell <- factor(sapply(strsplit(as.character(se$cell_line),
                                         "-"), function(x) x[1]))
table(se$samplegroupcell)

cell line: U2OS 
             44 

2.2 Sequencing Depth

Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order with respect to replicate groups. Figure 2 is with respect to treatment group, Figure 3 with respect to time, and Figure 4 with respect to presence of stimulation. This will be done with respect to every subgroup created above in order to gain as much information as possible about the variation in the data.

Library sizes in increasing order (replicate group).

Figure 1: Library sizes in increasing order (replicate group)

Library sizes in increasing order (treatment).

Figure 2: Library sizes in increasing order (treatment)

Library sizes in increasing order (time).

Figure 3: Library sizes in increasing order (time)

Library sizes in increasing order (stimulation).

Figure 4: Library sizes in increasing order (stimulation)

We see fairly substantial differences in sequencing depth, ranging from 5.89 to 12.13 million reads. It appears that the experimental condition groups, for the most part, are not very related to sequencing depth. There may be one exception - it seems that the samples treated with BIO0919278 were sequenced at a lower depth compared to the other treatments. All of the BIO0919278 samples were sequenced at a depth lower than 8.5 million reads, and the histogram shows that these samples are heavily concentrated to the left.

2.3 Distribution of Expression Levels Among Samples

Figure 5 below shows the distribution of expression values per sample in logarithmic CPM units of expression.

Non-parametric density distribution of expression profiles per sample.

Figure 5: Non-parametric density distribution of expression profiles per sample

Figure 6 below shows the distribution of expression values per sample in logarithmic CPM units of expression.

Non-parametric boxplot of expression profiles per sample.

Figure 6: Non-parametric boxplot of expression profiles per sample

The multidensity plot and boxplot reveal that there are not substantial differences between the samples in the distribution of expression values.

2.4 Distribution of Expression Levels Among Genes

Let’s calculate now the average expression per gene through all the samples.

avgexp <- rowMeans(assays(se)$logCPM)

Figure 7 shows the distribution of those values across genes.

Distribution of average expression level per gene.

Figure 7: Distribution of average expression level per gene

As expected, we have two modes, one for genes that are lowly expressed in nearly all samples and another for genes with some detectable levels of expression across a number of samples.

2.5 Filtering of Lowly-Expressed Genes

We filter lowly-expressed genes using the function filterByExpr(), grouping by sample-group to define the minimum number of samples in which a gene should be expressed. I will choose to use the sample group organized by replicates (samplegroupdescription) since it will eliminate genes that were consistently lowly expressed.

mask <- filterByExpr(dge, group=se$samplegroupdescription)
se.filt <- se[mask, ]
dim(se.filt)
[1] 13228    44
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 13228    44

We are left with 13228 genes.

2.6 Normalization

We will now calculate the normalization factors on the filtered expression data set.

dge.filt <- calcNormFactors(dge.filt)

The following code replaces the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE,
                              normalized.lib.sizes=TRUE)

2.7 MA-Plots

We examine now the MA-plots of the normalized expression profiles in Figure 8.

MA-plots of filtered and normalized expression values.

Figure 8: MA-plots of filtered and normalized expression values

These MA plots plot one sample at a time against the average of the rest of the samples. It should be generally expected that the ratio between M (log ratio) and A (mean) average should be close to 1 since we assume only a small fraction of the genes will be differentially expressed between samples. After viewing the MA plots for each sample, we can see that there may exist intensity dependent bias for many of the samples. First of all, we should keep an eye on samples with these biases in case they also display other unexpected features, because then we might consider removing them. However, we already utilized a between sample normalization technique (TMM). The reason for this was to compensate for any systematic technical effects so that we could attribute the observed differences between the samples to a biological signal. Since the data was already normalized, perhaps the observed differences are due to a biological signal. This is a plausible explanation since this experiment includes so many different experimental conditions (treatment, stimulation, time) and it is likely that these conditions would induce signficant changes in gene expression levels. Regardless, we should continue to be aware of potential biases.

2.8 Experimental Design and Batch Identification

We have alerady described the experimental design above. All combinations of cell line, treatment, stimulation, and stimulation time have been sequenced. However, not every biological replicate for each unique combination was sequenced. For this reason, we can anticipate that it won’t be possible to identify expression changes associated with the replicate sample group. Additionally, though the experimental protocol was the same for the replicates, it is possible that there exist batch effects related to the replicates that we will be unable to discern. This means that all differences between replicate A and replicate B samples may not just be due to biological differences but also logistical (time of sequencing, etc.).

We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating sample group and treatment. We calculate again log CPM values with a high prior count(3) to moderate extreme fold-changes produced by low counts. Visualizing the clustering of samples separated by replicate group as well as experimental conditions should help us to estimate potential batch effects. The same procedure (shown below) is followed for each sample group:


logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(se.filt$samplegrouptreatment)
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(se.filt$treatment, colnames(se), sep="\n")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
                               function(x, batch, labels) {
                                 if (is.leaf(x)) {
                                   attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
                                   attr(x, "label") <- as.vector(labels[attr(x, "label")])
                                 }
                                 x
                               }, batch, outcome)

The resulting dendrograms for each sample group are shown below. Figure 9 is with respect to treatment. Figure 10 is with respect to time. Figure 11 is with respect to stimulation. Figure 12 is with respect to replication.

Hierarchical clustering of the samples (treatment).

Figure 9: Hierarchical clustering of the samples (treatment)

Hierarchical clustering of the samples (time).

Figure 10: Hierarchical clustering of the samples (time)

Hierarchical clustering of the samples (stimulation) .

Figure 11: Hierarchical clustering of the samples (stimulation)

Hierarchical clustering of the samples (replication group).

Figure 12: Hierarchical clustering of the samples (replication group)

As expected, the samples for each of the different treatments tend to cluster separately from eachother. Additionally, it seems that samples for the two different stimulation/treatment times also cluster together. Therefore, it is a reasonable possibility that treatment and stimulation/treatment time seem to drive the largest portion of the variablity in the whole dataset. There is not much clustering for samples with respect to replicates or stimulation. However, since we know that the stimulant (TWEAK) activates noncanonical NF-kB signaling, we will continue to take into account the potential effects of stimulation on differential gene expression. Overall, these results may imply that there is not a strong batch effect related to replicates, as was previously hypothesized. In Figure 13 we show the corresponding MDS plot related to treatment and Figure 14 related to stimulation time, since those variables were the ones that showed the most significant clustering.

Multidimensional scaling plot of the samples (treatment)

Figure 13: Multidimensional scaling plot of the samples (treatment)

Multidimensional scaling plot of the samples (time)

Figure 14: Multidimensional scaling plot of the samples (time)

The MDS plot for tereatment shows clear differences between BIO0919278 treatments and all other treatments. It also reinforces the fact that there is clear clustering of samples by treatment. The MDS plot for stimulation time is a bit less clear. There is a clear difference for some of the treatments that were stimulated for 24 hrs, however, this is not consistent and there is not a strong clustering pattern with respect to stimulation time. It is a more reasonable explanation that the clustering seen from stimulation time can more accurately be attributed to treatment. For this reason, in this dataset it makes sense to compare different treatment conditions. Specifically, we are most interested in discerning the difference in gene expression between the two enantiomer treatments (BIO0919278 and BIO0702697). The MDS plot with respect to treatment indicates that we should expect differential gene expression between the two enantiomers.

3 Differential Expression

3.1 Preliminary Analysis (F Test)

We will first perform a simple assessment of the extent of expression changes and their associated p-values using the F-test implemented in the R/Bioconductor package sva. We compare treatment with BIO0919278 and its enantiomer BIO0702697. We first subset the data as follows:

se.filt.enants <- se.filt[, se.filt$treatment == "treatment: BIO0919278" | se.filt$treatment == "treatment: BIO0702697"]
se.filt.enants$treatment <- droplevels(se.filt.enants$treatment)
length(rownames(colData(se.filt.enants)))
[1] 12
se.filt.enants$treatment
                   V6                    V7                   V13 
treatment: BIO0919278 treatment: BIO0919278 treatment: BIO0702697 
                  V14                   V15                   V20 
treatment: BIO0919278 treatment: BIO0919278 treatment: BIO0702697 
                  V21                   V23                   V28 
treatment: BIO0702697 treatment: BIO0919278 treatment: BIO0702697 
                  V29                   V30                   V31 
treatment: BIO0702697 treatment: BIO0919278 treatment: BIO0919278 
Levels: treatment: BIO0702697 treatment: BIO0919278

In the second step above, we dropped the unused levels from the treatment factor variable. This is important to avoid using samples with treatments that do not exist in this subset of the data. We can see that this new dataset includes 12 samples, and looking at the treatment data we can see that 7 were treated with BIO0919278 and 5 were treated with BIO0702697.We can now build the corresponding full and null model matrices. Based on the results from the MDS plot earlier in the analysis, I have decided not to adjust for known covariates. The plots did not show any clear clustering patterns based on stimulation or time, and so I am opting to build my matricess with the unknown covariate methodology as follows:

mod <- model.matrix(~ se.filt.enants$treatment,
                    colData(se.filt.enants))
mod0 <- model.matrix(~ 1, colData(se.filt.enants))

Finally, we conduct the F-test implemented in the package sva and examine the amount of differential expression between treatment with BIO0919278 and its enantiomer BIO0702697. I have set the p values as 0.01 and 0.05 in order to reduce the number of DEGs reported since there are quite a high number.

library(sva)

pv <- f.pvalue(assays(se.filt.enants)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 4080
sum(p.adjust(pv, method="fdr") < 0.05)
[1] 6386

We obtain 4080 differentially expressed genesn (DEGs) at FDR < 1% and 6386 at FDR < 5%. In Figure 15 below we can see the distribution of the resulting p-values.

Distribution of raw p-values for an F-test on every gene between treatment with BIO0919278 vs BIO0702697.

Figure 15: Distribution of raw p-values for an F-test on every gene between treatment with BIO0919278 vs BIO0702697

The p-values are quite uniform, though there is some slight enrichment at lower p-values. This may indicate some potential confounding factors that are influencing the results. We build a table with the subset of DEGs with FDR < 5% and show the top-10 genes with lowest p-value in Table ?? below.

This was a very basic preliminary differential expression analysis. The analysis that we have conducted so far is “muddy” in the sense that it has failed to separate the samples by the presence of stimulation or other experimental conditions. In a biological sense, the presence or absence of stimulation should theoretically significantly affect the genes that are considered differentially expressed. The next step in the differential expression analysis will aim to more strongly account for confounding experimental conditions so that we can elucidate the true DEGs related to enantiomer treatment.

3.2 Limited Replication Analysis

The next step in the differential expression analysis is to try to elucidate the DEGs that are relevant to the enantiomer treatments, while controlling for other experimental conditions (stimulation, time). Initially, we attempted to separate the samples into multiple comparison groups by both stimulation and time. However, it was found that this dramatically reduced the number of samples in comparison groups, with some groups having only one sample. This method signficantly reduced the statistical power, and we were not confident in the list of DEGs that we received. Instead, the analyses are separated by stimulation (TWEAK vs no stimulation) so that we can eliminate the effects of stimulationn conditions when searching for DEGs between the enantiomer treatments. Thinking biologically, we decided it was more important to separate groups based on the presence of stimulation rather than time, since the TWEAK stimulant directly effects the non-canonical signaling pathway of interest. Two different comparison groups are created (dgeenantstim and dgeenantnostim) for subsequent DGE analyses.

Further, since the number of samples in each comparison group is rather small (between 5 and 7), we need to utilize limited replication technique. The following procedure is applied:


# Subsetting data

levels(se.filt.enants$stimulation) <- sub("stimulation: TWEAK", "TWEAK", levels(se.filt.enants$stimulation))
levels(se.filt.enants$stimulation) <- sub("stimulation: UnStim", "UnStim", levels(se.filt.enants$stimulation))
levels(se.filt.enants$treatment) <- sub("treatment: BIO0919278", "BIO0919278", levels(se.filt.enants$treatment))
levels(se.filt.enants$treatment) <- sub("treatment: BIO0702697", "BIO0702697", levels(se.filt.enants$treatment))
se.filt.enants.tweak <- se.filt.enants[, se.filt.enants$stimulation == "TWEAK"]
se.filt.enants.nostim <- se.filt.enants[, se.filt.enants$stimulation == "UnStim"]
se.filt.enants.nostim$treatment <- droplevels(se.filt.enants.nostim$treatment)
se.filt.enants.tweak$treatment <- droplevels(se.filt.enants.tweak$treatment)
se.filt.enants.tweak$treatment
        V7        V13        V15        V21        V23        V29        V31 
BIO0919278 BIO0702697 BIO0919278 BIO0702697 BIO0919278 BIO0702697 BIO0919278 
Levels: BIO0702697 BIO0919278
se.filt.enants.nostim$treatment
        V6        V14        V20        V28        V30 
BIO0919278 BIO0919278 BIO0702697 BIO0702697 BIO0919278 
Levels: BIO0702697 BIO0919278

# Subsetting DGEList

dgeenantstim <- DGEList(counts=assays(se.filt.enants.tweak)$counts, genes=data.frame(Symbol=mcols(se.filt.enants.tweak)$symbol))
dgeenantstim <- calcNormFactors(dgeenantstim)
dgeenantnostim <- DGEList(counts=assays(se.filt.enants.nostim)$counts, genes=data.frame(Symbol=mcols(se.filt.enants.nostim)$symbol))
dgeenantnostim <- calcNormFactors(dgeenantnostim)

#Comparison of Enantiomer Treatments (TWEAK, no covariates)

mod2 <- model.matrix(~ se.filt.enants.tweak$treatment, colData(se.filt.enants.tweak))
dgeenantstim <- estimateDisp(dgeenantstim, mod2)
fit2 <- glmQLFit(dgeenantstim, mod2)
qlf2 <- glmQLFTest(fit2, coef=2)
tt2 <- topTags(qlf2, n=Inf)
mod2
           (Intercept) se.filt.enants.tweak$treatmentBIO0919278
SRR7089374           1                                        1
SRR7089380           1                                        0
SRR7089382           1                                        1
SRR7089388           1                                        0
SRR7089390           1                                        1
SRR7089396           1                                        0
SRR7089398           1                                        1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`se.filt.enants.tweak$treatment`
[1] "contr.treatment"

#Comparison of Enantiomer Treatments (UnStim, no covariates)

mod3 <- model.matrix(~ se.filt.enants.nostim$treatment, colData(se.filt.enants.nostim))
dgeenantnostim <- estimateDisp(dgeenantnostim, mod3)
fit3 <- glmQLFit(dgeenantnostim, mod3)
qlf3 <- glmQLFTest(fit3, coef=2)
tt3 <- topTags(qlf3, n=Inf)

# Surrogate Variable Testing 

mod01 <- model.matrix(~ 1, colData(se.filt.enants.tweak))
sv1 <- sva(assays(se.filt.enants.tweak)$logCPM, mod=mod2, mod0=mod01)
Number of significant surrogate variables is:  1 
Iteration (out of 5 ):1  2  3  4  5  
sv1$n
[1] 1

mod02 <- model.matrix(~ 1, colData(se.filt.enants.nostim))
sv2 <- sva(assays(se.filt.enants.nostim)$logCPM, mod=mod3, mod0=mod02)
Number of significant surrogate variables is:  1 
Iteration (out of 5 ):1  2  3  4  5  
sv2$n
[1] 1

#Comparison of Enantiomer Treatments (TWEAK, unknown covariates)

mod5 <- model.matrix(~ se.filt.enants.tweak$treatment, colData(se.filt.enants.tweak))
dgeenantstim <- estimateDisp(dgeenantstim, mod5)
fit5 <- glmQLFit(dgeenantstim, mod5)
qlf5 <- glmQLFTest(fit5, coef=2)
tt5 <- topTags(qlf5, n=Inf)

#Comparison of Enantiomer Treatments (UnStim, unknown covariates)

mod4 <- model.matrix(~ se.filt.enants.nostim$treatment, colData(se.filt.enants.nostim))
dgeenantnostim <- estimateDisp(dgeenantnostim, mod4)
fit4 <- glmQLFit(dgeenantnostim, mod4)
qlf4 <- glmQLFTest(fit4, coef=2)
tt4 <- topTags(qlf4, n=Inf)

We can see that the tweak comparison group includes 7 samples with 4 treated with BIO919278 and 3 treated with BIO0702697. The nostim comparison group includes 5 samples, 3 of which were treated with BIO919278 and 2 treated with BIO0702697. This analysis was conducted both with no covariates and unknown covariates. We attempted to adjust for known covariates (time), but there was an error that the design matrix was not of full rank. This is due to the fact that we are using a limited replication analysis and we simply do not have enough data points. There was only one surrogate variable presented from the surrogate variable testing, which led us to believe that adjusting for unknown covariates would not be as beneficial.

Below we see a preview of the most signicant DEGs found using the limited replication differential expression analysis with respect to TWEAK stimulation. The previews are presented in the same order as the code below. Additionally, we arw saving the DEGs with FDR < 0.01.

# Differential Expression Table (TWEAK)

head(tt2,10)
Coefficient:  se.filt.enants.tweak$treatmentBIO0919278 
        Symbol     logFC   logCPM        F       PValue         FDR
6752     SSTR2 -3.263151 4.722807 220.0516 4.201147e-07 0.002216083
643836   ZFP62 -2.698245 3.880877 192.3153 7.071488e-07 0.002216083
7741   ZSCAN26 -3.619425 2.603803 187.8349 7.744089e-07 0.002216083
1958      EGR1  4.861429 9.401018 185.2372 8.170829e-07 0.002216083
64783    RBM15  2.275947 6.348114 180.1252 9.100032e-07 0.002216083
9807     IP6K1 -2.208667 5.194190 167.0127 1.216537e-06 0.002216083
54838    WBP1L -2.303788 5.045818 166.2091 1.239248e-06 0.002216083
10363   HMG20A -2.032194 5.217478 154.0232 1.658651e-06 0.002216083
2736      GLI2 -2.255932 5.267824 149.2553 1.870308e-06 0.002216083
7779   SLC30A1  2.615566 8.220260 138.2207 2.506077e-06 0.002216083
sum(tt2$table$FDR < 0.01)
[1] 571
sum(tt2$table$FDR < 0.05)
[1] 3571
tweakDEG <- head(tt2, 571)

#Differential Expression Table (UnStim)

head(tt3, 10)
Coefficient:  se.filt.enants.nostim$treatmentBIO0919278 
        Symbol     logFC   logCPM         F       PValue        FDR
1958      EGR1  5.596727 9.471384 115.53702 7.867152e-06 0.06248152
1959      EGR2  5.092510 4.427130  94.29537 1.612773e-05 0.06248152
2296     FOXC1  3.025920 5.384855  76.69481 3.313976e-05 0.06248152
6242      RTKN -3.136723 3.422720  75.67917 3.470290e-05 0.06248152
3726      JUNB  5.394030 8.506892  70.92467 4.339895e-05 0.06248152
205428 C3orf58  2.842493 5.790873  68.28324 4.943551e-05 0.06248152
8553   BHLHE40  2.863960 8.232418  64.80958 5.909282e-05 0.06248152
93517  SDR42E1 -2.308680 4.039448  60.04808 7.658122e-05 0.06248152
2736      GLI2 -2.430199 5.426803  59.02270 8.117441e-05 0.06248152
64783    RBM15  2.184114 6.393603  58.57774 8.327602e-05 0.06248152
sum(tt3$table$FDR < 0.05)
[1] 0
sum(tt3$table$FDR < 0.1)
[1] 2190

#Differential Expression Table (TWEAK, unknown cov)

head(tt5,10)
Coefficient:  se.filt.enants.tweak$treatmentBIO0919278 
        Symbol     logFC   logCPM        F       PValue         FDR
6752     SSTR2 -3.263151 4.722807 220.0516 4.201147e-07 0.002216083
643836   ZFP62 -2.698245 3.880877 192.3153 7.071488e-07 0.002216083
7741   ZSCAN26 -3.619425 2.603803 187.8349 7.744089e-07 0.002216083
1958      EGR1  4.861429 9.401018 185.2372 8.170829e-07 0.002216083
64783    RBM15  2.275947 6.348114 180.1252 9.100032e-07 0.002216083
9807     IP6K1 -2.208667 5.194190 167.0127 1.216537e-06 0.002216083
54838    WBP1L -2.303788 5.045818 166.2091 1.239248e-06 0.002216083
10363   HMG20A -2.032194 5.217478 154.0232 1.658651e-06 0.002216083
2736      GLI2 -2.255932 5.267824 149.2553 1.870308e-06 0.002216083
7779   SLC30A1  2.615566 8.220260 138.2207 2.506077e-06 0.002216083
sum(tt5$table$FDR < 0.01)
[1] 571
sum(tt5$table$FDR < 0.05)
[1] 3571

#Differential Expression Table (UnStim, unknown cov)

head(tt4,10)
Coefficient:  se.filt.enants.nostim$treatmentBIO0919278 
        Symbol     logFC   logCPM         F       PValue        FDR
1958      EGR1  5.596727 9.471384 115.53702 7.867152e-06 0.06248152
1959      EGR2  5.092510 4.427130  94.29537 1.612773e-05 0.06248152
2296     FOXC1  3.025920 5.384855  76.69481 3.313976e-05 0.06248152
6242      RTKN -3.136723 3.422720  75.67917 3.470290e-05 0.06248152
3726      JUNB  5.394030 8.506892  70.92467 4.339895e-05 0.06248152
205428 C3orf58  2.842493 5.790873  68.28324 4.943551e-05 0.06248152
8553   BHLHE40  2.863960 8.232418  64.80958 5.909282e-05 0.06248152
93517  SDR42E1 -2.308680 4.039448  60.04808 7.658122e-05 0.06248152
2736      GLI2 -2.430199 5.426803  59.02270 8.117441e-05 0.06248152
64783    RBM15  2.184114 6.393603  58.57774 8.327602e-05 0.06248152
sum(tt4$table$FDR < 0.05)
[1] 0
sum(tt4$table$FDR < 0.10)
[1] 2190

We obtain 571 differentially expressed genes (DEGs) at FDR < 1% and 3571 at FDR < 5% with TWEAK stimulation. In Figure 16 below we can see the distribution of the resulting p-values.

We obtain 0 differentially expressed genes (DEGs) at FDR < 5% and 2190 at FDR < 10% with no stimulation. In Figure 17 below we can see the distribution of the resulting p-values.

We receive the exact same results using unknown covariates. Therefore, the rest of the analysis will only include the results from the no covariate group. Additionally, since there are 0 DEGs in the unstimulated condition group with FDR < 5%, we will exclude these results from further analysis. Further, it makes biological sense that there are significantly fewer DEGs in the unstimulated condition since only the stimulated conditions are directly affecting the non-canonical signaling pathway of interest.

Lastly, we have saved the DEGs with FDR < 0.01 to be utilized further in this differential expression analysis pipeline.

Distribution of raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with TWEAK stimulation.

Figure 16: Distribution of raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with TWEAK stimulation

Distribution of raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with no stimulation.

Figure 17: Distribution of raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with no stimulation

The p-value distributions for both the stimulated and unstimulated conditions look fairly similar. However, it appears that the p-values are slightly more uniformly distributed in the stimulated condition. This distinction provides further evidence for using DEGs from the stimulated comparison group rather than the unstimulated comparison group.

3.3 Fold Change Analysis

Below, a new data frame is created that includes the fold change values (Log2FC, FC, 1/FC) as well as statistical indicators (p-values, FDR). This data frame will be utilized to generate a diagnostic volcano plot further in the analysis pipeline. A preview of the table is presented below, and the following procedure was followed:


# Log Fold Analysis and Filtering

ranking2 <- order(abs(tt2[["table"]][["logFC"]]), decreasing=TRUE)
tab <- data.frame(Gene=rowData(se.filt.enants.tweak)$symbol[ranking2],
                Log2FC=round(tt2[["table"]][["logFC"]][ranking2], digits=10),
                FC=round(2^tt2[["table"]][["logFC"]][ranking2], digits=10),
                "1/FC"=round(2^(-tt2[["table"]][["logFC"]][ranking2]), digits=10),
                Pvalues=round(tt2[["table"]][["PValue"]][ranking2], digits=10),
                FDR=round(tt2[["table"]][["FDR"]][ranking2], digits=10),
                row.names=rownames(tt2),
                check.names=FALSE)
head(tab)
           Gene   Log2FC       FC        1/FC      Pvalues         FDR
6752     ZNF256 8.524302 368.1888 0.002715998 0.0002004338 0.007303074
643836    SPC24 7.946608 246.6989 0.004053524 0.0040187656 0.025111115
7741   CDC42EP2 7.849888 230.7022 0.004334592 0.0005264471 0.010963274
1958     SEMA4B 7.770743 218.3869 0.004579029 0.0006131388 0.011587581
64783      FNTA 7.194514 146.4753 0.006827088 0.0137626519 0.050668622
9807     GPR180 6.875140 117.3879 0.008518766 0.0051539447 0.028525682

Figure 18 shows a volcano plot of log fold change values by raw p-values between genes treated with BIO0919278 and genes treated with BIO0702697 in the TWEAK stimulation group.

Volcano plot of log fold change values by raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with TWEAK stimulation.

Figure 18: Volcano plot of log fold change values by raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with TWEAK stimulation

The volcano plot above includes a horizontal FDR cutoff line of FDR < 0.01. The points above this line represent the 571 significant differentially expressed genes previously discussed above. This plot shows that the majority of these significant DEGs have negative log fold change values. With respect to this set of DEGs, this indicates biologically that treatment with BIO919278 tends produce underexpressed genes relative to treatment with BIO0702697.

3.4 Table of DEGs

We will also build a table with the subset of DEGs with FDR < 1% and show the top-10 genes with lowest p-value in Table 2 below.

Table 2: Differentially expressed genes. Top-10 differentially expressed genes with lowest p-value between the two enantiomers with FDR < 1% with TWEAK stimulation. To see the full list of DE genes, follow this link or download this CSV file.
Symbol logFC logCPM F PValue FDR
6752 SSTR2 -3.263151 4.722807 220.0516 4.0e-07 0.0022161
643836 ZFP62 -2.698245 3.880877 192.3153 7.0e-07 0.0022161
7741 ZSCAN26 -3.619425 2.603803 187.8349 8.0e-07 0.0022161
1958 EGR1 4.861429 9.401018 185.2372 8.0e-07 0.0022161
64783 RBM15 2.275947 6.348114 180.1252 9.0e-07 0.0022161
9807 IP6K1 -2.208667 5.194191 167.0127 1.2e-06 0.0022161
54838 WBP1L -2.303788 5.045818 166.2091 1.2e-06 0.0022161
10363 HMG20A -2.032194 5.217478 154.0232 1.7e-06 0.0022161
2736 GLI2 -2.255932 5.267824 149.2553 1.9e-06 0.0022161
7779 SLC30A1 2.615566 8.220260 138.2207 2.5e-06 0.0022161

4 Functional Enrichment Analysis

4.1 Gene Ontology Analysis

Below, we are preparing the data for the GO analysis by isolating the rownmames from our DEG list.


DEGrows <- rownames(tweakDEG)
length(DEGrows)
[1] 571

There are 571 genes in the stimulated comparison with FDR < 0.01.

The goal of this analysis is to elucidate the biological functions associated with these genes. We are specifically interested in viewing DEGs related to the molecular processes examined in the original article (noncanonical NF-kB signaling pathway). The gene universe provided in the GO analysis includes the genes in the se.filt.enants data set. The resulting table below provides a preview of the results for the GO analysis. Additionally, we have created an HTML page with the full results of the analysis. The following procedure was followed:


library(org.Hs.eg.db)
allHumanGO <- select(org.Hs.eg.db, columns="GO", key=keys(org.Hs.eg.db, keytype="SYMBOL"), keytype="SYMBOL")
sort(table(allHumanGO$EVIDENCE), decreasing=TRUE)

  IEA   IDA   IBA   TAS   ISS   IMP   IPI   NAS   HDA    ND   IGI   ISA    IC 
77387 62996 54631 43106 24288 20213 15429  7682  7366  1836  1574  1445  1299 
  IEP   ISM   EXP   RCA   HMP   HEP   ISO 
  878   757   293   133   130    70     7 
geneUniverse <- rownames(se.filt.enants)

# conditional non-filtered DEG
library(GOstats)
params1 <- new("GOHyperGParams", geneIds=DEGrows,    
              universeGeneIds=geneUniverse,
              annotation="org.Hs.eg.db", ontology="BP",
              pvalueCutoff=0.05, testDirection="over")
conditional(params1) <- TRUE
hgOverCond <- hyperGTest(params1)

goresults <- summary(hgOverCond)
goresults <- goresults[goresults$Size >= 3 & goresults$Size <= 500 & goresults$Count >= 3, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]

geneIDs <- geneIdsByCategory(hgOverCond)[goresults$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresults <- cbind(goresults, Genes=geneSYMs)
rownames(goresults) <- 1:nrow(goresults)

library(kableExtra)

use_directory(file.path("doc"))

## generate full table in HTML and store it into the 'doc' directory
ktab <- kable(goresults, "html", caption="GO results for conditional analysis in TWEAK stimulated group (FDR < 0.01).")
ktab <- kable_styling(ktab, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
fnameHTML <- "GOresults.html"
fpathHTML <- proj_path(file.path("doc", fnameHTML))
save_kable(ktab, file=fpathHTML, self_contained=TRUE)
browseURL("GOresults.html")
ktabshow <- kable(goresults[1:10,])
ktabshow <- kable_styling(ktabshow, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
ktabshow
GOBPID Pvalue OddsRatio ExpCount Count Size Term Genes
GO:0007296 0.0003406 64.67251 0.1782999 3 4 vitellogenesis ETV6, ZMIZ1, SOS1
GO:1903265 0.0002367 21.59375 0.3565999 4 8 positive regulation of tumor necrosis factor-mediated signaling pathway TRIM32, HSPA1A, HSPA1B, RIPK1
GO:0061626 0.0026940 16.16374 0.3120249 3 7 pharyngeal arch artery morphogenesis FOLR1, BMPR1A, NOG
GO:0003197 0.0040787 13.03465 0.3538781 3 8 endocardial cushion development RBM24, FOXF1, GATA4
GO:0003161 0.0041683 12.92982 0.3565999 3 8 cardiac conduction system development ID2, BMPR1A, MAML1
GO:0010944 0.0041683 12.92982 0.3565999 3 8 negative regulation of transcription by competitive promoter binding CREB1, SMAD7, BHLHE41
GO:0048617 0.0041683 12.92982 0.3565999 3 8 embryonic foregut morphogenesis FOXF1, GATA4, ACVR2B
GO:0070424 0.0041683 12.92982 0.3565999 3 8 regulation of nucleotide-binding oligomerization domain containing signaling pathway BIRC3, HSPA1A, HSPA1B
GO:0071550 0.0041683 12.92982 0.3565999 3 8 death-inducing signaling complex assembly TNFRSF1A, CASP8, RIPK1
GO:1904748 0.0041683 12.92982 0.3565999 3 8 regulation of apoptotic process involved in development FOXC1, TNFRSF1A, VDR

5 Discussion

The original article is primarily concerned with the noncanonical NF-kB signaling pathway and ways in which this pathway is molecularly regulated. This pathway plays an important role in a wide range of critical biological functions, including development and homeostatic control of the immune system. The paper identified molecule BIO0919278 as an activator of this pathway via inhibiting CDK12. The primary focus of this analysis was determine gene expression differences between cellular treatment with BIO0919278 and its (S) enantiomer BIO0702697 and uncover biological functions related to those differentially expressed genes.

The GO results from this analysis seem to provide reasonable evidence that verify some of the results claimed in the original paper. Since BIO0702697 was shown by the authors to be significantly less potent compared to BIO0919278, it was intended to serve somewhat as a negative control. An initial scan of the GO results reveal many biological pathways directly associated with various forms of cellular development development (regulation of apoptotic process involved in development, cardiac conduction system development, endocardial cushion development, notochord development, metanephric mesenchyme development, etc.). These results, though general, should be expected since BIO919278 activates noncanonical NF-kB signaling, which is crucial in various development pathways. Additionally, these GO results represent genes that are overexpressed in BIO0919278 relative to BIO702697. Therefore, the presence of developmental pathways in the results provides support for the idea that BIO0919278 is a much more potent activator of the noncanonical NF-kB signaling pathway relative to its (S) enantiomer.

On a more specific level, the authors were particularly interested in RNA polymerase (Pol) II–mediated transcription and the role in which BIO0919278 may play in this pathway. It was hypothesized that BIO0919278 targets and inhibits CDK12, which itself has been previously shown to activate RNA polymerase (Pol) II–mediated transcription. The GO results identify the RNA polymerase (Pol) II–mediated transcription pathway as signficant with respect to BIO0919278 with an odds ratio = 3.08. This provides concrete evidence for the role of BIO0919278 in targeting CDK12 and ultimately regulating this pathway. This pathway is especially important since many previous studies have identified CDK12 as an emerging therapeutic target for cancer due to its role in regulating cellular transcription.

Overall, the functional enrichment results from this analysis serve to verify the results presented in the initial article. There is a growing body of evidence that BIO0919278 ia an inhibitor of CDK12, and as such could potentially be a compound of interest in developing cancer therapeutics. Further research must be conducted in order to more accurately assess cellular changes induced by BIO0919278. Such experiments should include multiple types of analyses (RNASeq transcriptome analysis, chemoproteomics, confocal microscopy, etc.) in order to quantitatively classify the extent of BIO0919278’s involvement in the aforementioned biological pathways.

6 Session information

sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS/LAPACK: /Users/zacharycollester/opt/miniconda3/envs/rbase/lib/libopenblasp-r0.3.9.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GO.db_3.11.4                GOstats_2.54.0             
 [3] graph_1.66.0                Category_2.54.0            
 [5] Matrix_1.2-18               org.Hs.eg.db_3.11.4        
 [7] sva_3.36.0                  BiocParallel_1.22.0        
 [9] genefilter_1.70.0           mgcv_1.8-31                
[11] nlme_3.1-148                geneplotter_1.66.0         
[13] annotate_1.66.0             XML_3.99-0.3               
[15] AnnotationDbi_1.50.0        lattice_0.20-41            
[17] edgeR_3.30.0                limma_3.44.1               
[19] SummarizedExperiment_1.18.1 DelayedArray_0.14.0        
[21] matrixStats_0.56.0          Biobase_2.48.0             
[23] GenomicRanges_1.40.0        GenomeInfoDb_1.24.0        
[25] IRanges_2.22.1              S4Vectors_0.26.0           
[27] BiocGenerics_0.34.0         usethis_1.6.1              
[29] kableExtra_1.1.0            knitr_1.28                 
[31] BiocStyle_2.16.0           

loaded via a namespace (and not attached):
 [1] bitops_1.0-6           fs_1.4.1               bit64_4.0.2           
 [4] webshot_0.5.2          RColorBrewer_1.1-2     httr_1.4.1            
 [7] rprojroot_1.3-2        Rgraphviz_2.32.0       tools_4.0.0           
[10] backports_1.1.7        R6_2.4.1               KernSmooth_2.23-17    
[13] DBI_1.1.0              colorspace_1.4-1       bit_4.0.4             
[16] compiler_4.0.0         cli_2.0.2              rvest_0.3.5           
[19] xml2_1.3.2             bookdown_0.20          scales_1.1.1          
[22] readr_1.3.1            RBGL_1.64.0            stringr_1.4.0         
[25] digest_0.6.25          rmarkdown_2.3          AnnotationForge_1.30.1
[28] XVector_0.28.0         pkgconfig_2.0.3        htmltools_0.4.0       
[31] highr_0.8              rlang_0.4.6            rstudioapi_0.11       
[34] RSQLite_2.2.0          RCurl_1.98-1.2         magrittr_1.5          
[37] GenomeInfoDbData_1.2.3 Rcpp_1.0.4.6           munsell_0.5.0         
[40] fansi_0.4.1            lifecycle_0.2.0        stringi_1.4.6         
[43] yaml_2.2.1             zlibbioc_1.34.0        grid_4.0.0            
[46] blob_1.2.1             crayon_1.3.4           splines_4.0.0         
[49] hms_0.5.3              locfit_1.5-9.4         pillar_1.4.4          
[52] glue_1.4.1             evaluate_0.14          BiocManager_1.30.10   
[55] vctrs_0.3.0            assertthat_0.2.1       xfun_0.14             
[58] xtable_1.8-4           survival_3.1-12        viridisLite_0.3.0     
[61] tibble_3.0.1           memoise_1.1.0          ellipsis_0.3.1        
[64] GSEABase_1.50.0       

References

Henry, Kate L., Debra Kellner, Bekim Bajrami, John E. Anderson, Mercedes Beyna, Govinda Bhisetti, Tom Cameron, et al. 2018. “CDK12-Mediated Transcriptional Regulation of Noncanonical Nf-κB Components Is Essential for Signaling.” Science Signaling 11 (541). https://doi.org/10.1126/scisignal.aam8216.

Maglott, Donna, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. 2010. “Entrez Gene: Gene-Centered Information at Ncbi.” Nucleic Acids Research 39 (suppl_1): D52–D57. https://doi.org/10.1093/nar/gkq1237.

Ziemann, Mark, Antony Kaspi, and Assam El-Osta. 2019. “Digital Expression Explorer 2: A Repository of Uniformly Processed Rna Sequencing Data.” GigaScience 8 (4): giz022. https://doi.org/10.1093/gigascience/giz022.